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DETECTION OF EPIGENOMIC NETWORK COMMUNITY 

ONCOMARKERS 

By Thomas E. Bartlett*’ 1 ^, and Alexey Zaikin^ 

University College London § 

In this paper we propose network methodology to infer prognostic 
cancer biomarkers based on the epigenetic pattern DNA methylation. 
Epigenetic processes such as DNA methylation reflect environmen¬ 
tal risk factors, and are increasingly recognised for their fundamental 
role in diseases such as cancer. DNA methylation is a gene-regulatory 
pattern, and hence provides a means by which to assess genomic reg¬ 
ulatory interactions. Network models are a natural way to represent 
and analyse groups of such interactions. The utility of network mod¬ 
els also increases as the quantity of data and number of variables 
increase, making them increasingly relevant to large-scale genomic 
studies. We propose methodology to infer prognostic genomic net¬ 
works from a DNA methylation-based measure of genomic interaction 
and association. We then show how to identify prognostic biomarkers 
from such networks, which we term ‘network community oncomark- 
ers’. We illustrate the power of our proposed methodology in the 
context of a large publicly available breast cancer dataset. 


1. Introduction. Complex systems which can be modelled as networks 
are ubiquitous. Well-known examples include social/communication net¬ 
works (Beguerisse-Diaz et ah, 2014) and economic networks (Saavedra et ah, 
2014), as well as many others in the biological sciences such as ecological 
networks (Nandi, Sumana and Bhattacharya, 2014), gene networks (Wei 
and Pan, 2010; Li and Wang, 2014), protein networks (Mardia, 2013; Tran 
and Kwon, 2013), and metabolic networks (Reznik, Watson and Chaudhary, 
2013). Over the past few years in cell biology, much focus has shifted from 
investigation of individual genes, to pathways of genes, to gene networks. 
The interest in novel methodology for network analysis in cell biology fol¬ 
lows from the recognition that examining the way genes work in groups often 
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yields more accurate inference of biological processes. 

The problem of finding community structure in networks has been studied 
for many years. Important applications of this problem include identifying 
groups of friends or co-workers in social networks, as well as identifying 
functional subnetwork modules in biological networks (Girvan and New¬ 
man, 2002). In the biological setting, genes can be viewed as acting together 
as part of ‘subnetwork modules’, which are functional units with specific 
biological roles (Shen-Orr et al., 2002). Indeed, it has been demonstrated 
recently that such modularity is a natural and even inevitable result of 
evolutionary pressures (Clune, Mouret and Lipson, 2013). This is because 
modularity minimises network connectivity cost whilst maximising perfor¬ 
mance, and thus it represents the most parsimonious and efficient type of 
network structure for biological networks such as these. Furthermore, consid¬ 
ering groups of genes defined together as subgraphs can lead to big increases 
in statistical power, aiding discovery of biological phenomena (Jacob et al., 
2012; Li and Li, 2010; Peng et al., 2010). Therefore, it is relevant to both the 
biological and statistical modelling to consider the group behaviour of genes 
in this way. Hence, this viewpoint of modular genomic network structure is 
fundamental to the methodology we propose here. 

Epigenetic patterns are gene-regulatory patterns, meaning that they influ¬ 
ence the activity of particular genes, among other phenomena (Jones, 2012). 
Epigenetic information can be modulated during the lifetime of an organ¬ 
ism by environmental cues (Feinberg, Ohlsson and Henikoff, 2006; Cooney, 
2007; Christensen et al., 2009). As such, epigenetics can be considered to 
be an interface between the genome and the environment, and consequently 
also a conduit for environmental risk factors. Alterations in the epigenetic 
pattern DNA methylation are among the earliest changes in human carcino¬ 
genesis (Feinberg, Ohlsson and Henikoff, 2006), and hence DNA methylation 
patterns are expected to yield important prognostic information useful for 
biomarker development. DNA methylation patterns are thought promising 
for biomarker development in a wide variety of physiological systems and 
organs (Verschuur-Maes, de Bruin and van Diest, 2012; Van Hoesel et al., 
2013; Fleischer et al., 2014; Kishida et al., 2012; Gao et al., 2013; Kang 
et al., 2001, 2003; Bhagat et al., 2012; Yamamoto et al., 2012; Luo et al., 
2014; Navarro et al., 2012; Maekawa et al., 2013). 

It is well established that DNA methylation plays an important role in 
gene regulation, and hence DNA methylation patterns often reflect gene reg¬ 
ulatory behaviour (Jones, 2012). Changes in DNA methylation are highly 
stochastic. The timescale over which these changes take place is much faster 
than DNA mutations can arise, but much slower than the transient and 
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periodically varying activity of individual genes, and this timescale is ideal 
for biomarker development. DNA methylation data are extremely noisy; 
however, statistics which summarise DNA methylation patterns at the gene 
level have been shown to have much utility as analytical tools (Bartlett 
et ah, 2013). It has been shown previously that DNA methylation can serve 
as a surrogate measure of genomic-regulatory action (Brocks et al., 2014). 
Hence, DNA methylation measurements are a natural basis from which to 
construct genomic regulatory and related networks. As a cancer progresses, 
its signalling and control networks are rearranged (‘rewired’), leading to ge¬ 
nomic changes which are advantageous for the cancer (Barabasi and Oltvai, 
2004). Previous research has found that patient survival outcome in breast 
cancer can be predicted well by network models of this rewiring, based on 
gene expression data (Taylor et ah, 2009). Hence, network models based on 
DNA methylation measurements are a very promising basis for the develop¬ 
ment of prognostic biomarkers. 

Statistical network models are a parsimonious way to represent and anal¬ 
yse large numbers of variables and samples. They are efficient analytical 
tools appropriate for the very large datasets which are produced by the lat¬ 
est technologies in cell biology. When carrying out modelling of this type, 
it is important to balance statistical fidelity with computational efficiency. 
The ‘stochastic blockmodeh (SBM) (Holland, Laskey and Leinhardt, 1983; 
Bickel and Chen, 2009) is an efficient network model which has been widely 
studied and is well understood, and hence it is a good basis for our pro¬ 
posed methodology. Under the SBM, there is a greater probability of ob¬ 
serving an edge (or interaction) between a pair of nodes if they are in the 
same block, or community. The Newman-Girvan modularity (Newman and 
Girvan, 2004) quantifies the extent to which network edges are observed 
between community members, for a particular assignment of nodes to com¬ 
munities, compared to the expected number of edges between community 
members if there were no community structure present. It can be shown that, 
under certain conditions, fitting the stochastic blockmodel is equivalent to 
maximising the Newman-Girvan modularity over a network, and that these 
are both equivalent to spectral clustering (Riolo and Newman, 2012; Bickel 
and Chen, 2009). We use spectral clustering as an efficient computational 
algorithm for fitting the SBM. 

It has also been shown recently that, under reasonable assumptions, the 
SBM can be used to represent any network as a ‘network histogram’, what¬ 
ever the generating mechanism of that network. Further, the network his¬ 
togram provides a heuristic method to estimate the optimum number of 
blocks, or clusters, which a valid blockmodel representation of the net- 
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work may contain. This is important and useful, because it means that 
the blockmodel can be used to identify an unknown number of communi¬ 
ties, or functional subnetwork modules, in a biological network. Genomic 
networks are typically scale-free, which means that they exhibit a power- 
law degree distribution (Wagner, 2002). Further, they are thought to be 
hierarchical (Barabasi and Oltvai, 2004; Palla, Lovasz and Vicsek, 2010), 
displaying multi-scale properties. This means that different functional or¬ 
ganisation is visible at different granularities, or scales. We use the network 
histogram method (Olhede and Wolfe, 2014) to estimate the optimal granu¬ 
larity at which to identify communities, or functional subnetwork modules, 
in our prognostic networks by fitting the SBM. 

The main contribution of this work is to propose a well-integrated, and 
well-validated, statistical methodology for detecting biomarkers from the bi¬ 
ological viewpoint of modular genomic network structure, using DNA-based 
measurements of genomic regulatory patterns. To do this, we show how to 
integrate our previously proposed DNA methylation-based measure of in¬ 
teraction or association between pairs of genes, the ‘DNA methylation net¬ 
work interaction measure’ (Bartlett, Olhede and Zaikin, 2014), into a multi¬ 
stage pipeline to construct prognostic network community-based biomarkers. 
This leads to our novel and generally applicable statistical methodology; we 
present the multiple stages of this methodology sequentially here, and thus 
this paper is organised as follows. In Section 2, we outline our previously 
proposed DNA methylation network interaction measure (Bartlett, Olhede 
and Zaikin, 2014), and we show how to use this measure to infer prog¬ 
nostic genomic networks. An edge between a pair of genes/nodes in these 
networks indicates that the strength of interaction or association between 
those genes is associated with disease progression. Also, in Section 2, we 
show how to identify prognostic biomarkers from such networks, using com¬ 
munity detection to identify subnetwork modules within the network. These 
communities are groups of nodes/genes among which there is a high density 
of prognostic interactive or associative behaviour, and we term them ‘net¬ 
work community oncomarkers’. In Section 3, we demonstrate the utility of 
our proposed methodology in the context of a large, publicly available breast 
cancer dataset. To do so, we use each network community oncomarker to 
calculate a one-number prognostic score for each patient, and we use these 
scores to classify patients one by one into prognostic groups. Also in Section 
3, we show that among the genes of the network community oncomark¬ 
ers, the DNA methylation network interaction measure is associated with 
co-regulatory behaviour as measured by gene expression, justifying these 
findings in terms of biological function. 
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2. Proposed methodology. An overview of our proposed methodol¬ 
ogy appears in Figure 1, following which component parts of this method¬ 
ology are described in detail. 



Fig 1: Overview of methods. 

We note that, in principle, each of the steps illustrated in Figure 1 could be 
replaced with alternative choices of methodology. 

2.1. DNA Methylation Network Interaction Measure. DNA methylation 
is a chemical modification to DNA which may occur at numerous locations 
within a gene: the pattern of these modifications within a gene forms a ‘DNA 
methylation profile’. Using canonical correlation analysis (CCA) (Hotelling, 
1936) we previously proposed a statistic (Bartlett, Olhede and Zaikin, 2014) 
which measures the strength of interaction or association between a pair of 
genes (network nodes) in a single sample/patient, based on DNA methyla¬ 
tion profiles (Figure 2). This statistic quantifies the extent to which the DNA 
methylation profiles of a pair of genes explain each other. It is based only 
on measurements of the DNA methylation profiles of that pair of genes, and 
it acts as a surrogate for a measure of the extent to which this pair of genes 
behave interactively or associatively. Such behaviour may include transcrip¬ 
tional regulation or co-regulation, or other types of biochemical interaction, 
influencing gene expression levels, isoforms and the presence of alternatively 
spliced gene products, among other phenomena (Jones, 2012). The details 
of this DNA methylation network interaction measure are as follows. 
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The DNA methylation network interaction measure is defined by analogy 
to CCA. CCA aims to discover linear combinations of variables of one type, 
and linear combinations of variables of another type, so that these combina¬ 
tions best explain each other. In this context, a particular way of combining 
(by scaling and adding) the deviations from the mean methylation profile 
at a number of locations within one gene might be particularly effective at 
explaining a particular combination of (again, by scaling and adding) the 
deviations from the mean methylation profile at a number of locations in 
another gene, and vice versa. There will probably be fewer ways in which the 
methylation levels of these genes covary across the samples than there are 
locations at which methylation is measured along the genes; this is because 
the methylation level is highly correlated at many locations along a partic¬ 
ular gene. CCA finds the most important components of this covariation 
across samples. 

CCA seeks to find the vectors a and b, in the p and q dimensional spaces 
of variables X = (xi,X 2 , —,x p )' and Y = (yi,U 2 , respectively, which 

maximise the correlation p = cor (a'X, b 7 Y) defined according to equation 
1 : 

m n _ _ a'Syy-b _ 

U P ~ Va'SxxavVXyyb’ 

where 

E* X =E [(X-/* x )(X-/**)'] 

and 

Eyy=E[(Y-/i y )(Y-/* y )'] 
are the covariance matrices of X and Y respectively, 

V XY =E[(X-/1 A -)(Y-/Xy)'] 

is the cross-covariance matrix of X and Y, and \x x and fiy are the mean 
vectors of X and Y. 

Two genes X and Y have corresponding methylation profiles which are 
measured for sample / patient k at p and q CpGs (loci) respectively along 
these genes. Denoting these measurements by the variables x±, ...x p and 
yi, ...,y q for genes X and Y respectively, the DNA methylation profiles for 
these genes, for patient k, can be represented by the vectors x(fc) and y (k), 
which have p and q entries respectively. A measure of DNA methylation 
network interaction p X Y(k), of the methylation profiles of genes X and Y 
for sample k, can then be defined by analogy with equation 1, according to 
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equation 2: 


( 2 ) 


Pxv(k) 


_x c (fc) T S?ry c (fc)_ 

\J x. c (k) T £xx xC (k)\/y c {k) T £yly c (k) 


where Syy and are estimated from healthy rather than cancer 

samples in the methylation data set, according to equations 3-5, 


(3) ±xx = ^ X/ ( x ( fc ) “ Vx) (x(fc) - Ax } ) 

^ fcG healthy 


( 4 ) Syy = ^~ (y( k ) ~ Ay } ) (y(*0 - Ay } ) 

^ /cG healthy 


(5) sfy = -j- ^ (x(fc) - A?) (y(*0 - Ay } ) , 

^ &G healthy 

where 

= i £ x < fc )' 

fcG healthy 

= ^ £ y(*h 

/cG healthy 

rih is the number of healthy samples in the data set, and x c (fc) and y c (k) 
are the mean-centered methylation profiles x c (fc) = x(fc) — Ay* and y c (k ) = 
y(/c) — fiy ' 1 . The DNAm network interaction measure hence evaluates the 
extent to which, in an individual tumour sample, the combinations of the 
methylation-variables (i.e., loci) in genes X and Y explain each other, or 
covary, in the spaces determined by CCA on corresponding healthy sam¬ 
ples; that is, the covariation in tumour sample k between the methylation- 
variables in genes X and Y is assessed against typical healthy variability 
in these variables. When the DNA methylation network interaction measure 
Pxy(k ) is large (i.e., close to 1), the corresponding pair of genes explain each 
others’ gene-regulatory behaviour (as reflected in their methylation profiles) 
well, or have otherwise well-correlated interactive or associative behaviour, 
for sample/patient k. Hence, pxy(k) measures (according to their DNA 
methylation profiles) the level of interaction or association between genes X 
and Y in tumour sample k, compared to typical interactions between these 
genes in healthy tissue. 
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— High level of network interaction, P xy = 1 

— Low level of network interaction, = 0.07 

— Mean healthy methylation profile 
Healthy population profiles 

O Network node 
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Fig 2: The DNA methylation network interaction measure. 

A combination of the variation of the healthy methylation profiles in regions (a) and 
(b) of gene X explains well / is well-explained by a combination of the variation of 
the healthy methylation profiles in regions (c) and (d) of gene Y. The green cancer 
sample varies by a large amount about the mean methylation profile and in a typical 
way in these regions in both genes. Hence, the green sample corresponds to a high 
level of network interaction for this sample, pxy = 1 • The equivalent variations in 
the other regions of these genes do not explain each other well, and so the red sample, 
which varies by a large amount in these other regions and varies less and in an 
atypical way in regions (a) - (d), corresponds to a low level of network interaction, 
Pxy = 0.07. Genes X and Y are likely to have different numbers of methylation 
measurement locations (i.e., variables X and Y are of different dimension). The 
ordering of the measurement locations has no influence on the calculation of p, 
as long as the ordering is consistent across samples. This diagram was presented 
previously by Bartlett, Olhede and Zaikin (201)). 
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2.2. Prognostic Network Construction. Our proposed methodology for 
inference of network oncomarkers is based on a prognostic interaction net¬ 
work over m genes. This network is represented by the m x m adjacency 
matrix A, in which an edge is defined to to be present (i.e., A t] = 1) if 
and only if the corresponding pair of genes (nodes) are prognostic according 
to the DNA methylation network interaction measure of Section 2.1. Oth¬ 
erwise, we set A t j = 0. We note that i and j are now redefined compared 
to the last Section, so that they index genes rather than DNA methylation 
locations. This formulation will not be problematic, because all subsequent 
analysis is carried out at the level of genes rather than DNA methylation 
locations. To identify prognostic edges, we use the Cox proportional hazards 
model (Cox, 1972) to calculate a Wald-statistic Zij for each of the ) pairs 
of genes in the network. The Wald statistic quantifies the strength of associ¬ 
ation of the DNA methylation network interaction measure pij for the pair 
of genes i and j (i = 1 , m and j = 1 ,..., m) with patient survival outcome 
across patients k ( k = 1, ...n). We use a multivariate Cox model, adjusting 
these Wald statistics for clinical covariates, fitting this model separately to 
each pair of genes ( i,j ). We adjust in this way in order to detect novel DNA 
methylation biomarkers which are independent of known prognostic clinical 
features. 

The Wald statistic is asymptotically normally distributed with unit vari¬ 
ance (Harrell, 2001), and we can therefore model the distribution of our 
observed Wald statistics, Zij, as a mixture of Gaussians. We have previously 
demonstrated the utility of mixture modelling to a related network infer¬ 
ence problem (Bartlett, 2015), and a similar approach can be applied in this 
context. We model the as a Gaussian mixture as follows: 


( 6 ) 


Zij ~ 


if Aij = 1, 
A7(0,<t 2 ), if Aij = 0, 


where M ( Pij,<J 2 ) is the normal distribution, and we enforce a 2 = 1 in line 
with the asymptotic behaviour of the Wald statistic. We fit this mixture 
model to each observed statistic Zij, and then infer whether, given Zij, it is 
more likely that p Vj = 0, or /j n j ^ 0, leading to the estimates Ajj = 0 or 
Aij = 1 respectively. We fit this model using the empirical Bayes procedure 
of Johnstone and Silverman (2004), defining a mixture prior distribution 
/prior over the pij of equation 6: 


( 7 ) 


/prior (Pij') — (1 ^) $ (/hj) T (Pij ) > 


where w is the mixing parameter between the two components, which can 
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also be interpreted as w = E \p(Aij = 1 )], and 7 (-|a) is the Laplace proba¬ 
bility density function, 


7 


a 

2 




where we use the standard value of a = 0.5 (Johnstone and Silverman, 2004). 
Taking the mixture components to have Gaussian likelihoods, fj 7 (•[^,y,a 2 ), 
as in equation 6 , it follows from equation 7 that the posterior density over 
the observed prognostic Wald statistic is: 

( 8 ) 


/posterior \ Zij ) — 


(1 -w) 5 (fijj) fjsf (zjj |0, a 2 ) + W~f (fijj) (zij\fiij,a 2 ) 

/marginal ) 


where the marginal density is: 


(9) /marginal (~ij) — (1 ^)//V (^ij |0> ^ ) 4“ ^9 (Zij) j 

where g {/J-ij) is the convolution of the Laplace density with the standard 
normal density. If the Laplace distribution in the prior (equation 7) were 
replaced with a Gaussian, then the marginal distribution (equation 9) would 
be a mixture of Gaussians. However, as noted previously (Johnstone and 
Silverman, 2004), this empirical Bayes procedure requires a prior with tails 
that are exponential or heavier. Hence, we similarly use the Laplace rather 
than Gaussian prior which is a slight model misspecification. 

Although a separate model is fitted to each observed Wald statistic Zij, 
a common weight Wi is used for each gene/node i. We choose to do this, 
because estimating Wi separately for each gene i allows adaptation to a 
heterogenous degree distribution in A, as follows. For a particular gene i, if 
the are mostly close to zero, then Wi will be set low, which means that 
fewer edges {A^ = 1 ) will be detected; this hence corresponds to i being a 
low-degree node. If for a different gene i the Zjj are generally further from 
zero, then w l will be set high, which corresponds more edges being detected; 
this hence corresponds to i being a high-degree node. 

The estimate Wi is found as the value which maximises the marginal 
likelihood (equation 10 ) of the observed statistics z t j over all the pairwise 
comparisons of i with j, j ^ i. This allows the model for each such pairwise 
comparison (i,j) to ‘borrow strength’ from all the other comparisons (■ i,j'), 

j' + h j' j- 


Wi = arg max log {(1 - w)4> (%) + wg (z lJ )} . 

W Z -' 


( 10 ) 
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As in the original presentation of this methodology (Johnstone and Silver- 
man, 2004), we use the posterior median to obtain the estimate pij. Then 
we make a conservative estimate of A as follows: 

(11) Aij =1 if fiij > 0 and fiji >0 or fpj < 0 and fiji < 0, 

Ajj =0 otherwise. 

2.3. Community and Oncomarker Detection. Network nodes can be grouped 
together according to their propensity to interact with each other, for exam¬ 
ple groups of friends in a social network, or functional subnetwork modules 
in a biological network; this method is referred to as community detection 
(Girvan and Newman, 2002; Newman, 2004). We use community detection 
to naturally infer groups of genes in our constructed prognostic network. 
These groups of genes interact differently in cancer than in healthy tissue, 
in a way which is predictive of how advanced the disease is. We term these 
groups ‘network community oncomarkers’. Within a network community on¬ 
comarker the genes may interact with each other more (relative to healthy 
tissue) the more serious the disease is (as in Figure 6c), or they may interact 
with each other less the more serious the disease is, (as in Figure 6a). We 
carry out the task of community detection by fitting the degree-corrected 
stochastic blockmodel (Holland, Laskey and Leinhardt, 1983; Bickel and 
Chen, 2009). We fit this model in an efficient way by regularised spectral 
clustering (Qin and Rohe, 2013), calculating the optimum number of com¬ 
munities to divide the network into by the network histogram method (01- 
hede and Wolfe, 2014). Each community identified in this way represents a 
potential network community oncomarker. 

For each network community oncomarker, we then calculate a prognostic 
score for each patient, by summarising the DNA methylation network inter¬ 
action measure over this group of genes. This prognostic score can be used 
as a one-number summary of disease prognosis for that patient according to 
that network community oncomarker. The following points are important 
when calculating these summaries. Some gene-gene interactions will cor¬ 
respond to an increasingly negative DNA methylation network interaction 
measure p t3 for worse patient prognosis. On the other hand, some gene-gene 
interactions will correspond to an increasingly positive pij for worse prog¬ 
nosis. This means that care must be taken when summarising the network 
interaction measure across the network community oncomarker. Also, for 
the same amount of prognostic information conveyed, the magnitude of the 
changes in the network interaction measure may not be the same for each 
prognostic pairs of genes. To address these points, we combine the across 


12 


T.E. BARTLETT ET AL 


the prognostic pairs of genes of the network community after first multi¬ 
plying them by the corresponding fitted Cox proportional hazards model 
coefficients Oj-j , obtained as described at the start of Section 2.2. Under the 
Cox proportional hazards model, the fitted model coefficient 6ij for a pre¬ 
dictor ij gives the log of the hazard ratio (HR) for that predictor in the 
model, that is, log (HRjj) = 6ij. The hazard ratio is the scale-factor increase 
in probability of an event (e.g., death) occurring per unit time, relative to 
the baseline hazard (e.g., compared to a control group). Hence, these coeffi¬ 
cients are interpretable in the same way, without scaling issues, across fitted 
models. This means that, for patient k, we can combine the DNA methy- 
lation network interaction measures over a network community oncomarker 
to generate a one-number prognostic score, as follows: 

Score*, = ^2 A ij d ij p ij (k), 

ieC,jeC,i<j 


where C is the set of nodes in the network community oncomarker, A is 
the inferred adjacency matrix, Pij(k) is the DNA methylation network in¬ 
teraction measure for genes/nodes i and j and patient k, and B V] is the cor¬ 
responding fitted Cox multivariate proportional-hazards model coefficient. 
Network edges/DNA methylation network interaction measures ptj which 
increase with poor prognosis (i.e., pairs of genes which interact more as the 
disease progresses, coloured green in Figure 6), will correspond to &ij > 0. 
Hence, an increase in such a p t j will increase the prognostic score. Equiv¬ 
alently, network edges/DNA methylation network interaction measures pij 
which decrease with poor prognosis (i.e., pairs of genes which interact less as 
the disease progresses, coloured red in Figure 6), will correspond to 0 tJ < 0. 
Hence, a decrease in such a pij will also increase the prognostic score. 

2.4. An equivalent gene-expression interaction measure. To examine fur¬ 
ther the hypothesis that the DNA methylation network interaction mea¬ 
sure is a reflection of co-regulatory or co-regulated gene-expression patterns 
(among other genomic effects), we need an equivalent measure of gene-gene 
interaction or association in terms of gene expression. We can calculate such 
a measure, p^y (k), for gene expression measurements x expv (k ) and y expr (k) 
for the genes X and Y and patient k, as follows (equation 12): 

(x expr (k) - (y expr (fc) - a!?U 

PXY W - {h) (h) 

U ^expr U ^expr 


( 12 ) 
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where 

AS$U = — X x expr (k) and 4el Pr = — X ?/ eXPr ( fc )’ 

^ /cGhealthy ^ fcGhealthy 

(cr<ih) 2 = ^- E aSU) 2 

^ /cG healthy 

and 

(d-h) 2 = ^ E (»“■”(*)-a'51.) 2 . 

fcG healthy 

The intuition of equation 12 is that when the gene expression measurements 
x expr (fc) and y expr (k) deviate in the same sample from the corresponding 
healthy mean expression levels, this measure will be nonzero. When this oc¬ 
curs in the same samples as the DNA methylation network interaction mea¬ 
sure pxv(k ) is also nonzero, we will see a correlation between pxy(k ) and 
P & xy ■ These interaction measures for methylation and expression, pxy(k) 
and p°xyi are equivalent because they both measure deviation from typical 
interactive behaviour in healthy/control samples. 

3. Examples. We present an example application of the methodology 
proposed in Section 2 to a large publicly available breast cancer invasive 
carcinoma (BRCA) dataset downloaded from the Cancer Genome Atlas 
(TCGA). We downloaded an initial batch of DNA methylation data for 
tumour samples taken from 175 individuals (the training set), together with 
clinical data for these samples relating to patient survival outcome, and the 
covariates age, disease stage, and residual disease. These training data were 
used to detect potential network community oncomarkers. We then down¬ 
loaded DNA methylation data for a further 528 tumour samples (the test 
set), together with data for the same clinical features: these independent 
samples were used to validate the potential network community oncomark¬ 
ers. We also downloaded corresponding DNA methylation data for healthy 
breast tissue samples from 98 individuals to form a reference population of 
DNA methylation profiles for this analysis, and we downloaded gene expres¬ 
sion data for 216 of the tumours for which DNA methylation data were also 
available. To proceed, we estimated from the training set the healthy pop¬ 
ulation means, covariances and cross-covariances required to calculate the 
Pij (i = 1, ...,m and j = 1, as well as the corresponding log hazard 

ratios dij and adjacency matrix A. Additionally from the training data we 
estimated the communities in the adjacency matrix (including the number 
of communities) and prognostic score thresholds used to assign patients to 
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better and worse prognostic groups. We then used these estimates to verify 
the prognostic ability of the methodology in the test set. 

We first inferred the binary prognostic adjacency matrix A for the 175 
samples of the BRCA training data set according to the methods set out 
in Sections 2.1 - 2.2. DNA methylation data were available for 14829 genes, 
and hence the number of nodes/genes m in the inferred adjacency matrix A 
is m = 14829. The presence of an edge in A, that is, A tJ = 1, indicates that 
the interaction between genes i and j is associated with disease progression. 
The edge density of A is 0.0035, that is, p(Aij = 1) = 0.0035. We then 
extracted the connected component from this inferred network and carried 
out community detection on this connected component as described in Sec¬ 
tion 2.3. This resulted in 33 communities ranging from 116 to 285 nodes 
in size. The reduced adjacency matrix relating to these communities [with 
m = 5668 and p{Aij = 1) = 0.023] is shown in Figure 3. We note that the 
stochastic blockmodel, fitted in this way via spectral clustering, does not 
provide any uncertainty as to the inferred community assignments: if this 
is desired, then mixed-membership stochastic blockmodels are available as 
an alternative (Airoldi et ah, 2008). In the analysis we present here, uncer¬ 
tainties arising from these inferred community assignments are considered 
in the subsequent analyses (Figures 4 and 5 and Tables 1 and 2). 

We validated each of the 33 potential network community oncomarkers 
in the 528 independent tumour samples of the test/validation set. We note 
that these 528 samples were not used in any way to identify the 33 po¬ 
tential network community oncomarkers shown in Figure 3. Hence in this 
validation each of these 528 patients were classified individually according 
to prognosis without reference to the other validation samples. This means 
that comparing these prognostic classifications assigned to the validation 
samples is a true test of prognostic ability of the network community onco¬ 
markers. To carry out the validation, we calculated the prognostic score for 
the 528 independent/unseen samples of the test set, based on the inferred 
prognostic adjacency matrix A and the fitted Cox multivariate proportional 
hazards model coefficients 6 obtained from the initial 175 samples of the 
training set. Using this trained model, we calculated one prognostic score 
for each potential network community oncomarker for each of the 528 unseen 
test-set samples. We then tested the prognostic score, for each potential net¬ 
work community oncomarker, for significant prediction of patient survival 
outcome in these 528 unseen test-set samples. The five potential network 
community oncomarkers which validated in this way with the highest level 
of significance are outlined in red in Figure 3. The results of univariate 
and multivariate Cox regression for these five best network community on- 
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Fig 3: The inferred prognostic adjacency matrix after community detection. 
Entries in the adjacency matrix equal to 1 (representing a network edge) are coloured 
blue. Detected communities are outlined in black. The potential network community 
oncomarkers which are analysed further in Figures j - 1 and Tables 1-2 and Tables 
SI - S5 in the supplement are outlined in red, and labelled (a) - (e). 
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Fig f: Network community oncomarkers: Kaplan-Meier plots for the training set. 
Comparison of survival curves for the patient groups defined by the prognostic score 
for each network community oncomarker. The groups are divided by the median 
prognostic score in the 175 samples of the training data set. The hazard ratio (HR) 
is displayed with 95% C.I. in brackets, with the corresponding p-value calculated by 
univariate Cox regression, (a) - (e) indicate network community oncomarkers 1 - 
5, as shown in Figure 3. 
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Fig 5: Network community oncomarkers: Kaplan-Meier plots for the test set. 
Comparison of survival curves for the patient groups defined by the prognostic score 
for each network community oncomarker. The groups are divided by the median 
prognostic score in the 175 samples of the training data set. The hazard ratio (HR) 
is displayed with 95% C.I. in brackets, with the corresponding p-value calculated by 
univariate Cox regression, (a) - (e) indicate network community oncomarkers 1 - 
4, as shown in Figure 3. 
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comarkers are shown in Figures 4 and 5, and in Tables 1 and 2, for the 
training and test sets respectively. Plots equivalent to Figures 4 and 5 for all 
33 detected network communities appear in Supplementary Figures S1-S2. 
For the multivariate analysis, samples with missing data for any of the clin¬ 
ical covariates were removed, leaving 172 and 396 samples for the training 
and test sets respectively. We note that, as would be expected, the level of 
significance in the training set (to which the model was fitted, Figure 4 and 
Tablel), is much higher than in the test set (Figure 5 and Table 2). 

Figure 6 shows the five network community oncomarkers which validated 
most significantly. Green edges indicate gene-gene interactions which become 
stronger with disease progression. Red edges indicate interactions which be¬ 
come weaker with disease progression. Hence, the network community onco¬ 
markers of Figures 6a and 6b can be considered to be functional subnetwork 
modules which become less active as the cancer progresses (comprised of 
99% and 96% red edges, respectively). On the other hand, Figures 6c and 
6 d can be considered to be functional subnetwork modules which become 
more active as the cancer progresses (both comprised of 99% green edges). 
Then the network community oncomarker of Figure 6e contains a mixture of 
these effects (comprised of 87% red and 13% green edges). However, each of 
these network community oncomarkers represents a functional subnetwork 
module which is rewired in a way which is advantageous for the cancer, in 
favour of proliferation, and against cell death and immune function. The 
genes/nodes of these network community oncomarkers are shown in Tables 
SI - S5 in the supplement; they list many genes related to cell proliferation 
(e.g., CDKL1 , NKAPL, MAPK6 ), developmental processes (e.g., HOXDIO, 
HOXB9, HOXCIO , H OX A13, HOXC12 , HOXD13), and immune function 
(e.g., VSIG2, IL36B , RBPJ). 

We hypothesise that the DNA methylation network interaction measure is 
a reflection of co-regulatory or co-regulated gene-expression patterns, among 
other genomic effects. We tested this hypothesis by comparing the DNA 
methylation network interaction measure pxy f° r a pair of genes XY (equa¬ 
tion 2) with an equivalent measure of interactive behaviour of these genes in 
terms of their expression levels, p e £y (equation 12). Correlation test p- values 
for the comparison between pxy and p^y appear in Figure 7. It is clear that 
in these histograms, there is a concentration of significant p-values close to 
zero, indicating a departure from the null hypothesis uniform distribution, 
and demonstrating an association between pxY and p C yy for many of the 
edges/interactions of each network community oncomarker. However, there 
are also many nonsignificant p-values visible in these histograms, indicat¬ 
ing that there are other genomic interactive effects present which cannot be 
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HR (95%CI) 

P 

n 

Prognostic Score 

77.1 (10.5-567) 

<0.001 

172 

Age 

1.79 (0.66-4.84) 

0.249 

172 

Residual Disease 

15.4 (4.68-50.9) 

<0.001 

172 

Stage 

2.85 (0.96-8.46) 

0.060 

172 


(a) Network community oncomarker 1. 



HR (95%CI) 

P 

n 

Prognostic Score 

51.3 (8.35-315) 

<0.001 

172 

Age 

1.42 (0.48-4.23) 

0.53 

172 

Residual Disease 

30.4 (5.82-158) 

<0.001 

172 

Stage 

1.95 (0.68-5.54) 

0.212 

172 


(b) Network community oncomarker 2. 



HR (95%CI) 

P 

n 

Prognostic Score 

50.1 (9.77-256) 

<0.001 

172 

Age 

2.16 (0.81-5.8) 

0.125 

172 

Residual Disease 

13.3 (4.54-39.1) 

<0.001 

172 

Stage 

2.41 (0.81-7.18) 

0.114 

172 


(c) Network community oncomarker 3. 



HR (95%CI) 

P 

n 

Prognostic Score 

22.7 (5.52-93.1) 

<0.001 

172 

Age 

3.49 (1.3-9.42) 

0.0135 

172 

Residual Disease 

16.3 (5.24-50.7) 

<0.001 

172 

Stage 

1.05 (0.38-2.91) 

0.928 

172 


(d) Network community oncomarker 4. 



HR (95%CI) 

P 

n 

Prognostic Score 

46.0 (8.17-259) 

<0.001 

172 

Age 

2.91 (1-8.44) 

0.0493 

172 

Residual Disease 

7.04 (2.68-18.5) 

<0.001 

172 

Stage 

3.74 (1.23-11.4) 

0.02 

172 


(e) Network community oncomarker 5. 


Table 1 

Network community oncomarkers - training set prognosis. 

Multivariate Cox regression was used to test significance of the prognostic scores obtained 
from the network community oncomarkers. (a) - (e) indicate network community 
oncomarkers 1 - 5, as shown in Figure 3. 
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HR (95%CI) 

P 

n 

Prognostic Score 

4.89 (1.65-14.5) 

0.00429 

396 

Age 

3.52 (1.46-8.49) 

0.00513 

396 

Residual Disease 

12.5 (5.32-29.3) 

<0.001 

396 

Stage 

1.62 (0.66-4) 

0.294 

396 


(a) Network community oncomarker 1. 



HR (95%CI) 

P 

n 

Prognostic Score 

5.07 (1.81-14.1) 

0.00195 

396 

Age 

3.67 (1.49-9.03) 

0.00458 

396 

Residual Disease 

8.72 (3.78-20.1) 

<0.001 

396 

Stage 

1.47 (0.6-3.61) 

0.406 

396 


(b) Network community oncomarker 2. 



HR (95%CI) 

P 

n 

Prognostic Score 

2.63 (1.01-6.89) 

0.0484 

396 

Age 

2.07 (0.86-5) 

0.106 

396 

Residual Disease 

11.3 (4.97-25.5) 

<0.001 

396 

Stage 

2.04 (0.76-5.45) 

0.157 

396 


(c) Network community oncomarker 3. 



HR (95%CI) 

P 

n 

Prognostic Score 

4.92 (1.8-13.5) 

0.00189 

396 

Age 

1.91 (0.78-4.69) 

0.159 

396 

Residual Disease 

17.2 (6.76-43.9) 

<0.001 

396 

Stage 

0.92 (0.34-2.48) 

0.871 

396 


(d) Network community oncomarker 4. 



HR (95%CI) 

P 

n 

Prognostic Score 

2.5 (0.94-6.65) 

0.0668 

396 

Age 

2.23 (0.94-5.27) 

0.0677 

396 

Residual Disease 

8.17 (3.47-19.3) 

<0.001 

396 

Stage 

1.59 (0.64-3.95) 

0.321 

396 


(e) Network community oncomarker 5. 


Table 2 

Network community oncomarkers - test/validation set prognosis. 

Multivariate Cox regression was used to test significance of the prognostic scores obtained 
from the network community oncomarkers. (a) - (e) indicate network community 
oncomarkers 1 - 5, as shown in Figure 3. 
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Network interaction stronger 
with worse prognosis 

Network interaction weaker 
with worse prognosis 

• Gene / network node 


Fig 6: Detected network community oncomarkers. 

(a) - (e) indicate network community oncomarkers 1 - 5, as shown in Figure 3. 
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Fig 7: Correlation of DNA methylation with gene expression for the network com¬ 
munity oncomarkers. 

(a) - (e) indicate network community oncomarkers 1 - 5, as shown in Figure 3. 
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explained in terms of gene expression (as assessed by mRNA levels) alone. 
Such effects are expected to include the influence of alternatively spliced 
products or isoforms (Jones, 2012) and the interaction between noncoding 
transcripts and the epigenome (Lai and Shiekhattar, 2014). 

4. Discussion. In this paper, we have proposed methodology to de¬ 
tect cancer biomarkers based on the epigenomic pattern DNA methylation. 
This methodology builds on a previously proposed measure of pairwise in¬ 
teraction between genes, based on the epigenomic gene-regulatory pattern 
DNA methylation (Bartlett, Olhede and Zaikin, 2014). Based on this DNA 
methylation network interaction measure, the methodology we describe in 
this paper allows inference of prognostic genomic networks, and identifica¬ 
tion of prognostic biomarkers from such networks using community detec¬ 
tion methodology. Community detection has previously proved powerful as 
well realistic in a range of fields, including social as well as biological net¬ 
works (Girvan and Newman, 2002). In the context of genomic networks, such 
modular groups of genes are known to correspond to specific physiological 
functions (Shen-Orr et ah, 2002). The modular prognostic biomarkers which 
we detect are termed ‘network community oncomarkers’; they are groups of 
nodes/genes among which there is a high density of prognostic genomic in¬ 
teractive or associative behaviour. We have demonstrated that within these 
communities, the DNA methylation network interaction measure is highly 
associated with co-regulatory behaviour linked to gene expression (at the 
mRNA level), giving functional relevance to the findings. However, there 
are also likely to be a range of genomic interactive effects present which are 
measured by the DNA methylation network interaction measure but which 
are not reflected in mRNA levels. Our proposed methodology also allows 
a one-number prognostic score for a network community oncomarker to be 
calculated for each patient/sample: this prognostic score is a measure of 
disease progression in that patient. 

Our proposed methodology uses mixture modelling to infer network struc¬ 
ture from prognostic association between genes, and draws on practical ap¬ 
proaches to community detection to obtain oncomarkers from this prognostic 
network. Mixture modelling has previously been shown to be an effective ap¬ 
proach to the related problem of clustering in networks (Vu et al., 2013). 
This suggests that more general methodology could be developed here, in 
which network and community inference are both carried out simultaneously 
by model fitting. Network inference has also been carried out previously us¬ 
ing multiple node attributes in cell biological data (Katenka et ah, 2012), 
and those findings could be used as a basis upon which data from other 
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genomic sources could be integrated into the methodology proposed here. 
Genes also frequently carry out multiple roles in different biological contexts 
and hence may be involved in more than one functional subnetwork mod¬ 
ule within a genomic network. Work has been carried out on overlapping 
stochastic blockmodels (Latouche et al., 2011), and hence this would be a 
natural context in which to develop an application for such methodology. 

The field of epigenomics is progressing fast and promises many new in¬ 
sights in the near future into unexplained or undiscovered genomic phe¬ 
nomena, for example relating to the so-called ‘dark matter’ of the genome 
(Venters and Pugh, 2013). Epigenomics is also expected to provide new un¬ 
derstanding of the mechanisms of disease progression. The discovery that 
some genomic loci gain or lose methylation in ways which may be unique to 
cancer suggests that understanding changes in DNA methylation machinery 
may be essential to understanding oncogenesis (Xie et al., 2013). The field 
of network science is also advancing rapidly. Networks are an efficient way to 
represent and analyse large numbers of variables, which is particularly rel¬ 
evant in modern, large-scale genomic studies. Networks of interactions are 
a natural way to represent and analyse genomic interactions, associations 
and processes. Therefore, the study of genomic and epigenomic networks 
promises to be productive over the coming years for the fields of biology, 
medicine, and statistics. 

5. Datasets. DNA methylation (DNAm) data from breast cancer in¬ 
vasive carcinoma (BRCA) tumour samples, collected via the Illumina In- 
finium HumanMethylation450 platform, were downloaded from The Cancer 
Genome Atlas (TCGA) project (Hampton, 2006; Bonetta, 2006; Collins and 
Barker, 2007) at level 3. These data were preprocessed by first removing 
probes with nonunique mappings and which map to SNPs (as identified in 
the TCGA level 3 data); probes mapping to sex chromosomes were also re¬ 
moved; in total 98384 probes were removed in this way from all data sets. 
After removal of these probes, 270985 probes with known gene annotations 
remained. Probes were then removed if they had less than 95% coverage 
across samples; probe values were also replaced if they had corresponding 
detection p-value greater than 5%, by KNN (k nearest neighbour) imputa¬ 
tion (k = 5). The loci of analysed CpGs were mapped to genes based on 
annotation information for the Illumina Infmium platform obtained from 
the R / Bioconductor package TlluminaHumanMethylation450k’. The data 
were also checked for batch effects by hierarchical clustering and correlation 
of the significant principle components with phenotype and batch: no sig¬ 
nificant batch effects (which would warrant further correction) were found. 
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We downloaded DNA methylation data for tumour samples from 175 sam¬ 
ples/individuals, from TCGA in July 2013, with clinical data available for 
patient survival outcome, and the clinical covariates age, disease stage, and 
residual disease. At the same time, we also downloaded corresponding DNA 
methylation data for healthy tissue for 98 individuals. These data were used 
to detect potential network community oncomarkers. We then downloaded 
DNA methylation data for a further 528 tumour samples from TCGA in 
September 2014, with data for the same clinical features available. These in¬ 
dependent samples were used to validate the potential network community 
oncomarkers. At this time we also downloaded gene expression data from 
TCGA at level 3, for 216 of the tumours for which we also obtained DNA 
methylation data. 
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